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1.  Introduction 


The  problem  of  detecting  edges  in  computer  image  processing 
can  be  formulated  as  a  problem  of  determining  the  correlation 
between  an  n*n  neighborhood  of  the  image  and  a  template  contain¬ 
ing  an  ideal  edge.  This  is  a  special  case  of  the  more  general 
problem  of  measuring  the  dependence  between  two  random  variables. 
The  Pearson  product-moment  (PPM)  correlation  is  the  most  common¬ 
ly  used  measure,  yet  there  are  underlying  assumptions  that  are 
often  neglected  when  PPM  correlation  is  applied  in  practice. 

Statisticians  have  developed  many  methods  for  the  measure¬ 
ment  of  dependence  that  do  not  have  the  same  restrictions  that 
PPM  does.  A  good  introduction  to  several  of  these  is  Kruskal 
[1] .  Included  in  this  survey  is  a  discussion  of  the  quadrant 
measure  of  association.  This  measure  is  also  presented  in 
Blomqvist  12]  where  it  is  referred  to  as  the  double  median  test 
for  association. 

In  Blomqvist  [2]  as  well  as  Elandt  [3],  the  double  median 
test  is  called  a  non-parametric  test  of  tendency.  Here  we  call 
the  double  median  test  and  its  generalization  distribution- free 
tests.  We  have  adopted  the  following  convention  as  presented  in 
Gibbons  14] : 

Definition  1;  A  distribution  free  method  is  one  based  on  func¬ 
tions  of  sample  observations  whose  corresponding  random  variables 
have  distributions  which  do  not  depend  on  the  specific  distri¬ 
bution  function  of  the  population  from  which  the  sample  was  drawn. 


Definition  2:  A  nonpar aroe trie  teat  is  a  test  for  a  hypothesis 
which  i *»  not  a  statement  about  parameter  values. 

Definition  3;  A  parameter  is  a  characteristic  of  the  population. 
A  par  time  ter  can  be  thought  of  as  an  unspecified  constant  appear¬ 
ing  in  a  family  of  probability  distributions,  but  it  can  also 
be  thought  of  as  any  characteristic  of  the  population,  properly 
including  those  constants  appearing  in  probability  distributions. 

In  this  paper  we  adopt  the  broader  definition  of  parameter. 

In  this  sense,  a  median  or  other  order  statistic  is  a  population 
parameter,  for  example.  Typically,  in  using  distribution-free 
methods,  the  underlying  assumption  is  that  the  population  is 
continuous.  We  make  that  assumption  here. 

We  now  discuss  Blomqvist's  double  median  test  for  association, 
followed  by  its  generalization,  referred  to  as  generalized 
Blomqvist  correlation  (GBC) .  Some  of  the  properties  of  GBC  are 
discussed,  and  it  is  snown  that  GBC  is  assymptotically  more  effi¬ 
cient  than  Blomqvist's  double  median  test. 


2.  Blomqvist's  double  median  test 

As  discussed  in  Section  1,  the  motivation  behind  distri¬ 
bution-free  methods  is  for  these  methods  to  be  valid  under 
weak  assumptions  about  the  population  distribution.  In  addi¬ 
tion  to  this,  Blomqvist  [2]  also  gave  as  a  motivation  the 
notion  that  such  tests  should  be  easy  to  deal  with  in  practice. 
This  aspect  will  be  given  further  attention  later  in  this 
paper.  The  assumption  made  about  the  population  is  that  the 
cdf  F(x,y)  is  assumed  to  have  continuous  marginal  cdfs  F^x) 
and  F2  (y) .  This  is  done  so  that  Prob(x^=Xj}  or  Probty^y^ }=0 
for  lsisn  and  lsjSn  and  i^j. 

Let  , . .  .  ,  <xn»yn)  be  a  sample  with  such  a  cdf  F(x,y). 

Denote  by  it  the  probability 

Trg=Prob{  (x<xQ  and  y<yQ)  or  (x>xQ  and  y>yQ)}  (1) 

for  some  xQ  and  yQ.  Similarly  denote  by  tt^  the  probability 
Trd=Prob{  (x<xQ  and  y>yQ)  or  (x>xQ  and  y<yQ)  (2) 

for  the  same  Xq  and  yQ.  The  above  equations  can  be  rewritten  as 
Trg=prob{  (x-xQ)  <y-yQ)  >0} 

Trd*Prob{  (x-xQ)  (y-yQ)  <0}  (3) 

Here  one  can  easily  see  that  irg  is  the  probability  that  the 
deviations  of  x  and  y  from  Xq  and  y^  have  the  same  sign,  whereas 
na  is  the  probability  of  different  signs. 

We  define  as  a  measure  of  correlation  the  difference  in  these 
two  probabilities.  As  in  Kruskal  [1] ,  we  denote  this  measure  by 


Although  the  choice  of  x^  and  y^  is  arbitrary,  we  choose  to 

let  xn=x  *  the  median  of  x,  and  let  yn=y  ,  the  median  of  y. 

y  will  be  1  iff  x  and  y  deviate  from  their  medians  in  the  same 

direction,  and  -1  iff  x  and  y  deviate  from  their  medians  in 

opposite  d;rections.  Clearly  if  x  and  y  are  independent  it  . 

s  o 

and  so  y  is  zero. 

We  construct  the  sample  analog  in  the  following  way.  The 

xy  plane  is  divided  into  four  regions  by  the  lines  x=xm  and 

y=ym  (see  Figure  1) .  The  four  quadrants  will  be  labelled  by 

the  Roman  numerals  I,  II,  III,  IV  as  follows: 

Quadrant  I:  Those  pairs  (xi,yi>  such  that  xi<xm  and  yi<ym* 

Quadrant  II:  Those  pairs  (x.,y.)  such  that  x>x  and  y.<y  . 

l  l  lm  l  m 

Quadrant  III:  Those  pairs  (x.,y.)  such  that  x.>x  and  y . >y  . 
v  c  i'-'i'  i  m  Jm 

Quadrant  IV:  Those  pairs  (x.,y)  such  that  x.<x  and  y.>y  . 

1  l  lm  i  m 

From  the  above  definitions,  it  is  clear  that  the  number  n  of 

samples  used  is  even.  Should  the  sample  size  be  odd,  then  either 

one  or  two  samples  fall  exactly  on  the  lines  x=x  and  y=y  .  If 

m  m 

there  is  one  point  (the  point  (x  ,y  ))  we  do  not  count  this.  If 

m  m 

tnere  are  two  points  (the  points  (x^y^  and  for  SOme 

and  j) ,  we  do  not  count  one  of  them  and  we  let  the  other  one  be 
included  in  both  regions  it  touches.  If  the  sample  size  is  even, 

there  is  no  problem  and  the  medians  are  defined  by 

x  = (x+x  )/2 
m  n  n  . 

2  T'L 

y  =(y  +y„  )/2 

7  2+1 

Let  n^  be  the  number  of  points  in  quadrants  I  and  III.  Similarly, 
let  n2  be  the  number  of  points  in  quadrants  II  and  IV.  Clearly, 


n=n^+n^ .  We  define  as  a  measure  of  dependence  between  the  two 

th 

(4) 


random  variables  x  and  y  the  statistic  q*  where 

n1-n2 


nltn2 


The  statistic  q'  lies  between  -1  and  1,  which  is  desirable  for  a 
measure  of  dependence. 

One  of  the  earliest  presentations  of  this  statistic  was  in 
Mosteller  [5] .  This  presentation  used  two  lines  to  partition  the 
xy  plane  in  the  x  direction,  and  one  in  the  y  direction.  The 
double  median  test  is  the  limiting  case  when  these  two  lines  coin 
cide  at  the  median.  An  alternative  definition  of  q*  in  terms  of 
U-statistics  was  given  in  Elandt  [3] ,  and  this  is  the  method  used 
in  the  next  section.  The  estimate  q'  was  given  in  Blomqvist  [2] 
along  with  its  asymptotic  distribution  and  asymptotic  relative 
efficiency  (ARE) .  It  was  shown  that  the  ARE  of  q'  relative  to 
r  (the  estimate  of  the  PPM  correlation)  is  about  41%. 

The  interested  reader  is  also  directed  to  Bradley  [6]  for  an 
elementary  discussion  of  this  test  and  others  based  on  order  sta¬ 


tistics. 


3 .  A  generalization  of  the  double  median  test 

The  test  statistic  q'  is  computed  by  counting  the  number  of 
sample  points  in  the  four  quadrants  of  the  xy  plane.  The  lines 
which  divide  the  xy  plane  are  x=x„  and  y=y  ,  where  x  and  y 
are  the  x  and  y  sample  medians,  respectively.  Here  we  investi¬ 
gate  what  happens  when  additional  order  statistics  are  intro¬ 
duced  to  divide  the  xy  plane  into  smaller  regions. 

Before  proceeding  it  is  necessary  to  introduce  some  notation. 
Let  N  be  the  number  of  sample  points,  where  N  is  assumed  to  be 
even.  If  N  is  odd,  it  is  made  even  by  a  procedure  similar  to 
that  described  in  Section  2  for  the  double  median  test.  Let  n 
be  the  number  of  regions  into  which  the  x  and  y  axes  have  been 

2 

divided.  Hence  the  total  number  of  regions  in  the  xy  plane  is  n  . 

We  divide  the  x  and  y  axes  into  n  regions  by  introducing  n-1 
order  statistics  from  the  x  and  y  samples,  respectively.  The 
correlation  scheme  that  uses  n-1  order  statistics  is  called 
"Generalized  Blomqvist  Correlation  of  Order  n." 

Let  denote  the  ith  x  order  statistic  from  GBC  of  order  n. 

Similarly,  let  denote  the  ith  order  statistic  from  the 

same  correlation  scheme.  Using  this  notation  one  can  see  that 


S<2)=x 
^1  m 

„<2) 


=y 


m 


Analogous  to  the  definitions  in  Section  2,  we  denote  by  it 
probability 

,'")=Prob((C<2><x<t<n)  and  (r, ‘2[<y<n ) 


(n) 


the 


(5) 


for  l^isn,  and  by  ir^  '  the  probability 

n]n)-Prob{(5f2i<x<^in))  and  <n£!}<y<n£!i+1) )  (6) 

for  lsisn.  and  n are  taken  to  be  -°°  and  4^  and 

0  0  ^n  'n 

are  taken  to  be  +®. 

Note  that  in  the  double  median  test  n  +tt.=1.  Here  tt  +tt  jn' =1 

s  d  s  d 

iff  n-2.  Hence  we  define 

^n)=Prob{x/(4^},5]n))  or  y/ln^.nf1  )  or  **  (nn-i'nn-i+l)  }  (7) 

irjn^  is  the  probability  that  both  x  and  y  lie  in  the  interval 
s 

bounded  by  their  order  statistics  of  the  same  rank.  it ^  is  the 
probability  that  y  lies  in  the  interval  bounded  by  order  statistics 
whose  ranks  are  related  to  the  ranks  of  the  x  order  statistics 


by  the  equation  j=n-i+l.  Here  j  is  the  rank  of  y  order  statistic 
and  i  is  the  rank  of  x  order  statistic.  7r^n*  is  the  probability 
that  x  and  y  lie  in  other  regions  of  the  xy  plane,  e.g.,  Tr^n^  = 

l-(Ttg  +Tid  ). 

We  define  as  a  measure  of  association  where  Y^nj  is 


given  by 


(n)  (n) 

Y  /  ,  =  H  -TTj 
(n)  s  d 


By  considering  the  xy  plane  as  an  n*n  matrix  R,  the  probability 

ir^n*  is  the  probability  that  the  sample  (x^,y^)  lie  in  the  regions 

along  the  main  diagonal.  v^n ^  is  the  probability  that  the  sample 

(x-.y.)  lie  in  the  regions  along  the  diagonal  from  the  lower  left 

to  the  upper  right  corner.  is  the  probability  that  (x^y^) 

( 2 ) 

falls  in  the  other  regions  of  the  xy  plane.  If  n=2, 
there  are  no  other  regions. 


=0  since 


is  computed  in  the  following  manner. 


The  sample  statistic  q) 


(n) 


(n) 


Denote  by  r^j  the  region  of  the  xy  plane  such  that 


^i-i<  x  <  and  n-i-i<  y  K  0 


(n) 

'i 


(n) 

'j-l 


(n) 

3 


(9) 


Let  |rj^|  denote  the  number  of  samples  in  the  region  rfn^ 


ID 


13 


The 


sample  statistic  <3jn)  is  computed  by 

q'  =  — (Z  I r  ^ n  ^  I  -  £  |r^n^  I) 

^(n)  N^=1 ' rii  '  '  i,n-i+l' 


(10) 


Let 


n.  =  Z  |r<n)| 

1  i=l  11 

n  =  ?  I  _  (n) 

2  i,n-i+ll 


Then 


q’/ 


(n) 


nl-n2 

N 


(ID 


If  there  is  a  positive  correlation  between  x  and  y  then 
nl>n2  and  q(n)  approach  +1.  If  there  is  a  negative  correla¬ 

tion  between  x  and  y,  then  n.,>n^  and  q'^  approaches  -1.  If  x 
and  y  are  independent  then  n^-n^  and  q^nj=0. 

We  now  examine  an  alternative  formulation  of  the  sample  sta¬ 
tistic  q'(n)  in  terms  of  U  statistics  as  described  in  Hoeffding  [7]. 
By  showing  that  q'^  is  a  U  statistic,  we  can  immediately  deter¬ 
mine  both  the  asymptotic  distribution  and  the  variance  of  q|nj* 
From  this,  the  ARE  follows  immediately. 


(where  the  summation  is  over  all  permutations  to  am  of  the 
integers  1  to  n)  are  unbiased  estimators  of  their  population 
characteristics  6.  It  is  also  shown  that  /n  (U-0)  tends  to  a 
normal  distribution  as  n  -*■  °°.  It  is  desirable  that  the  statis¬ 


tic  <3'(n)  have  these  properties,  and  hence  we  demonstrate  a  U  sta¬ 
tistic  formulation  of  q'  ,  . 

(n) 

Randles  and  Wolfe  [8]  also  provide  a  description  of  U  sta¬ 
tistics.  They  do  point  out,  however,  the  importance  of  the 
estimability  of  the  population  parameter.  Hence  we  state  and 


prove 

Theorem  1 :  y ^  is  estimable  of  degree  1. 

Proof :  To  show  that  y ^  is  estimable  of  degree  1,  we  show  that 

there  exists  a  function  <|>  such  that 

E{*(xl,}  =  Y(n> 

Let  be  the  sample  (x^,y^)  and  let  4>(Z^)  be 

n  n 

<J>  (  Z  .  )  =  E  Z  (x  .  r  j  , K)  (13) 

1  j=l  k=l  1  1 


where  <l>  (xi»yi,  j  ,k)  = 

|i,(j=k)  and  <xi-Sjn>> (nk-i“yi) (yi_nkn)>  >  0 

-1 ,  (k=n- j )  and  (x^s^  )  (^"[-y^  (y^n^  )  >  0 


0,  otherwise 


(14) 


Now  we  must  show  that  E{<MZ^)}  =  Y^nj  for  all  distributions 
F  in  the  family  F.  It  is  clear  that  <f>(Z^)  does  not  depend  on 
F ( • )  in  any  way.  We  are  considering  all  distributions  F(*)  under 
the  null  hypothesis  :  F (x, y) =F1 (x) F2 (y ) ,  that  is,  the  hypothe¬ 
sis  that  x  and  y  are  independent. 

Under  this  hypothesis,  Y^nj=0,  so  it  suffices  to  show  that 
E { <J>  (Z^)  }=0.  Under  this  hypothesis,  all  values  of  (x^ ,y^ ,  j  ,k) 
are  equally  likely,  so  that 

n  n 

E{  <J>  ( Z  •  )  }  =  E{  E  E  iMx.y.  ,  j,k)  } 
j=l  k=l  1  1 

n  n 

=  E  E  E{iHx.  ,  y  .  ,  j,k)  } 
j=l  k=l  1  1 

=  0 

which  is  the  desired  result:  y  ^  is  estimable  of  degree  1. 

Knowing  that  Y(nj  is  estimable  of  degree  1,  we  can  construct 
the  U  statistic 

U  =  |  E"< MZ.)  (15) 

where  is  defined  in  equation  (13),  <(i  (x^^  ,yi ,  j  ,k)  is  defined 

in  equation  (14)  and  the  sum  E"  extends  over  all  i  such  that  lii^N. 

By  the  results  of  Hoeffding  [7) ,  we  know  that  U  as  given  in 
equation  (15)  is  an  unbiased  estimate  of  Y(nj-  Also  /n  (U-y^) 
has  a  limiting  normal  distribution  as  N  +  ». 

Now  it  is  of  interest  to  determine  the  variance  of  the  sta¬ 
tistic  U.  With  this  in  hand,  the  ARE  of  U  relative  to  q'  can  be 
computed . 


The  variance  of  U  is  computed  by  the  following  equation 


from  Randles  and  Wolfe  [8]: 

Var(U)  =  |  I  6  <r-cKc 
(”)c=l  c  r  c  c 

Since  the  estimability  r  is  1  here, 


Var(U)  =  g 


(16) 


where  is  given  by 

=  E{«|>2  (Zi)  >  -  y^n)  (17) 

Again  we  compute  this  expectation  under  the  null  hypothesis 
of  independence.  The  values  of  <|>(x^,y^,j,k)  are  equally  likely 
and  therefore 

E{<t>2  (Z  -  ) }  *  E{  (E  E  (x. ,y.f j,k))2} 

1  j=l  k*l  1  1 

n  n  n  n 

a  E{E  E  E  E  ij)  (x.  ,y .  ,  j  ,k)  t  (x.  ,y. ,  j '  ,k') } 

j*l  k=l  j'=l  k ' =1  11  11 

The  product  is  1  for  j=j '  and  k=k'  and  otherwise  0.  Since  1  is 

twice  as  likely  as  0  takes  on  -1,0,  or  1),  the  expectation 

can  now  be  solved: 


2  n  n  n  n 

E(<j>  ( Z . ) }  =  E{ E  E  E  E  <J»(x.  ,y. ,  j,k)ij>(x.y.  j*  ,k') } 

1  j=l  k=l  j'=l  k '=1  11  1  1 


2  ,2n.  4 

1  n^  *  3n 


Hence  equation  (17)  becomes 

<i  *  E<+2<zi>>  -  **„) 


~  3n  ~  Y  (n) 


(18) 


(19) 


Put  equation  (19)  into  equation  (16)  to  obtain  the  variance 
of  the  statistic  U: 


1 


var(U)  =  t 


1  , 

N  S1 


l(i- 

N  3n 


Y(n)> 


(20) 


This  result  is  used  in  the  next  section  to  compute  ARE(U,q') 


5.  The  Asymptotic  Relative  Efficiency  of  U 

The  asymptotic  relative  efficiency  of  two  statistical  tests 
is  an  indication  of  the  relative  power  of  the  tests.  That  is  to 
say,  it  is  a  measure  of  the  ratio  of  the  sample  sizes  the  two 
tests  require  to  achieve  the  same  level  of  statistical  signifi¬ 
cance.  The  interested  reader  is  referred  to  Gibbons  [4]  for  a 
detailed  explanation  and  several  worked  out  examples. 

Here  we  will  investigate  the  ARE  of  U  relative  to  q'.  We 
would  like  to  see  if  ARE ( U , q ' )  is  at  least  1  for  all  n.  The  ARE 
is  defined  by 


ARE  (U ,  q  ' )  = 


lim 


e  (U) 


N  -*•  <»  e  (q‘ ) 

whe:  a(-)  is  the  efficacy  of  the  test  statistic: 


(21) 


e(T)  -  l-OElTlW  (22) 

0  <T)|esl0^ 

E (T)  is  the  pected  value  of  the  test  statistic  T,  and  0  is  the 

populatiu  "ter.  In  ootn  o.-^ies  dE(T)/d0  is  1  since  E(q')  =  y 

and  E  (U ,•  =  y(r>.  Thus,  equation  '21)  becomes 

2  .  . . 


XKta.  q 


i  lm 

N  •* 


0  ?Llli 
o^OJ) 


(23) 


It  is  also  necessary  to  in^ :  ate  the  hyp  ..  ..‘-s  being  tested, 
since  in  the  process  of  taking  the  limit  we  are  also  letting  the 
alternate  hypothesis  approach  lie  null  hypothesis.  As  previously 
indicated  in  this  paper  our  hypotheses  are 


y  =  0 


y*0 


4a0{l-2a0) 

From  Blomqvist  [2|,  the  variance  of  q'  is - ^ - 


From 
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equation  (20)  the  variance  of  U  is  jj(3h  “  ^  (rt)'*  '  Note  that 

N 

the  k  in  the  expression  of  var(q')  is  equal  to  j  from  the  defi 
nitions  in  Blomqvist  [2J.  We  now  compute  ARE (U/q1 ) : 


ARE ( U ,  q  ' )  = 


N  -*■  00 

lim 
N  -*■  “ 

lim 
N  “ 


a2(U) 

r4a0(l"2a0)  /  1/4  2  , 

1  k  /  N(3n "  Y  (n) } 

r4a0(1‘2a0)  .  3Nn  , 

k  .  ,  2  * 

4-3ny 


Since  N=2k , 


ARE (U,q 1 )  = 


N  -*•  00 


2  1 
Now  taking  the  limit,  Y(nj=0,  ag=  ^ 

ARE  (U,q  ‘ )  *  ~ 


8a0U-2aQ)3n 
4'3y2  , 


Hence,  ARE ( U , q ' )  is  greater  than  1  for  all  n. 


6.  Conclusion 


It  has  been  shown  that  by  a  suitable  extension  of  the  double 
median  test  for  trend,  a  more  efficient  test  results,  relative 
to  the  original  test.  This  more  efficient  test  has  been  referred 
to  as  Generalized  Blomqvist  Correlation  of  order  n,  where  n  is 
the  number  of  regions  into  which  the  x  and  y  axes  of  the  xy  plane 
have  been  divided  by  n-1  order  statistics  from  each  sample. 

There  is  a  tradeoff,  though,  in  using  GBC.  Although  GBC  is 
more  efficient  in  a  statistical  sense,  it  involves  more  computa¬ 
tion  than  does  the  double  median  test.  Both  tests  have  been  used 
as  part  of  a  computer  program  for  analyzing  images.  The  task 
of  computing  either  correlation  coefficient  was  divided  into  two 
parts.  First,  the  sample  order  statistics  were  computed,  and  in 
the  seoand  part,  the  N  samples  were  classified  into  the  appropriate 
regions  of  the  xy,  and  the  coefficient  of  correlation  was  computed. 

The  algorithm  used  for  the  first  part  can  be  found  in  Aho  et 
al.  (9].  Figure  2  gives  a  comparison  of  the  asymptotic  time 
bounds  for  the  execution  of  each  half  of  the  computation  of  the 
correlation  coefficient.  All  logarithms  are  to  the  base  2.  In 
computing  q',  one  can  easily  see  that  this  is  simply  computing  the 
statistic  U  when  n=2.  Hence  the  computational  complexity  increases 
linearly  with  n. 

Further  work  in  this  area  might  be  concerned  with  determining 
if  GBC  is  more  efficient  than  similar  parametric  tests.  This 
question  has  not  been  addressed  in  this  paper.  Another  generaliza¬ 
tion  investigated  could  be  similar  to  that  of  Mosteller  [5]  in 


letting  the  order  statistics  move  and  choosing  those  order  sta¬ 
tistics  that  give  optimal  results  for  some  appropriate  criteria. 
Also  not  addressed  was  the  case  where  a  different  number  of  sta¬ 
tistics  from  the  y  sample  is  chosen,  say  m,  so  that  the  division 

2 

of  the  xy  plane  is  not  into  n  regions  but  nm  where  nj*m.  This, 
and  other  interesting  questions  concerning  distribution-free 
statistics,  remain  to  be  answered. 
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Figure  1.  xy  plane  divided  by  the  lines  x=xm  and  y=y 
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Figure  2.  Asymptotic  time  bounds  for  computing  q'  or 
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